The aim of this first data analysis is to explore a cancer single cell dataset paired with the spatial data we will see in the next tutorials and try to annotate its cell subpopulations. Having a good annotation at the single cell level helps studying spatial data in multiple ways, in particular to deconvolute the cell types (and their proportions) when the spatial assay does not have single cell resolution (like 10x Genomics Visium: https://cdn.10xgenomics.com/image/upload/v1645041033/analysis-guides/Spatial-AnnotatedSCdata-Illus.png), or when the spatial assay has great resolution but cannot capture the whole transcriptome, and we need to define a panel of genes that is meaningful to our experimental design (such as the probes used by image-based spatial approaches).

Along the way, you can try the function below if you have too many objects in your workspace and start wondering which are taking up a lot of your system’s memory (you see the total currently used memory on the right, under the ‘Environment’ tab, showed as a pie/donut chart when ‘Memory Usage Report…’ is selected). If you want to delete some you don’t need anymore, this will help you prioritize them by size.

memory_by_obj <- function(list, sort_by="size", decreasing=T) {
        obj_sizes<-data.frame()
        for (itm in list) {
        obj_sizes<-rbind(obj_sizes,
        data.frame(name=itm, size=as.numeric(formatC(object.size(get(itm))))))
        }
        obj_sizes[order(obj_sizes[,sort_by], decreasing=decreasing),]
}
Setup
knitr::opts_chunk$set(echo = TRUE)

# Choose and set your working directory for all code chunks for knitr.
# Since my RProject is in my working directory I choose current folder with ".".
knitr::opts_chunk$set(root.dir = ".")

# check the file names in the folder where you downloaded them:
list.files("exercise_material/")
## [1] "10x_human_breast_cancer_gene_panel.csv"                                                                        
## [2] "Chromium_FFPE_Human_Breast_Cancer_Chromium_FFPE_Human_Breast_Cancer_count_sample_filtered_feature_bc_matrix.h5"
## [3] "CytAssist_FFPE_Human_Breast_Cancer_filtered_feature_bc_matrix.h5"                                              
## [4] "CytAssist_FFPE_Human_Breast_Cancer_spatial.tar.gz"
# in 'exercise_material' I have the raw data files
# in 'exercise_results' I will save the results of each tutorial, hence I have a folder called "1"
# for the first tutorial
Libraries
# tested with R version 4.3.3
library(Seurat)
## Loading required package: SeuratObject
## Loading required package: sp
## 'SeuratObject' was built under R 4.3.0 but the current version is
## 4.3.3; it is recomended that you reinstall 'SeuratObject' as the ABI
## for R may have changed
## 'SeuratObject' was built with package 'Matrix' 1.6.3 but the current
## version is 1.6.5; it is recomended that you reinstall 'SeuratObject' as
## the ABI for 'Matrix' may have changed
## 
## Attaching package: 'SeuratObject'
## The following object is masked from 'package:base':
## 
##     intersect
library(SeuratDisk) # remotes::install_github("mojaveazure/seurat-disk")
## Registered S3 method overwritten by 'SeuratDisk':
##   method            from  
##   as.sparse.H5Group Seurat
library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(patchwork)
set.seed(42)

Load the single cell dataset:

sc.data <- Seurat::Read10X_h5(filename = "exercise_material/Chromium_FFPE_Human_Breast_Cancer_Chromium_FFPE_Human_Breast_Cancer_count_sample_filtered_feature_bc_matrix.h5",
                      use.names = TRUE, unique.features = TRUE)

# We read the raw data in here. It is a big matrix!

Check the class, structure and dimensions of the dataset (expression x cell barcode (bc) matrix):

sc.data %>% class
## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"
sc.data[1:3,1:3]
## 3 x 3 sparse Matrix of class "dgCMatrix"
##        AAACAAGCAAACGGGA-1 AAACAAGCAAATAGGA-1 AAACAAGCAAATGACT-1
## SAMD11                  .                  2                  .
## NOC2L                   .                  1                  .
## KLHL17                  .                  .                  .
dim(sc.data)
## [1] 18082 30365
#or all in one:
sc.data %>% str
## Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   ..@ i       : int [1:73628041] 13 40 48 108 114 146 171 225 235 261 ...
##   ..@ p       : int [1:30366] 0 812 4787 5265 6369 8926 14684 16486 20536 21331 ...
##   ..@ Dim     : int [1:2] 18082 30365
##   ..@ Dimnames:List of 2
##   .. ..$ : chr [1:18082] "SAMD11" "NOC2L" "KLHL17" "PLEKHN1" ...
##   .. ..$ : chr [1:30365] "AAACAAGCAAACGGGA-1" "AAACAAGCAAATAGGA-1" "AAACAAGCAAATGACT-1" "AAACAAGCAACAAGTT-1" ...
##   ..@ x       : num [1:73628041] 1 1 1 1 1 1 1 1 3 1 ...
##   ..@ factors : list()

Initialize the Seurat object with the raw (non-normalized) data:

sc <- CreateSeuratObject(counts = sc.data, project = "scFFPE", min.cells = 3)

We will then follow through the standard pre-processing workflow proposed by the Seurat package: https://satijalab.org/seurat/articles/pbmc3k_tutorial.html

# Compute percent of counts coming from mitochondrial genes:
sc[["percent.mt"]] <- PercentageFeatureSet(sc, pattern = "^MT-")
# Visualize QC metrics as a violin plot
VlnPlot(sc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
## Warning: Default search for "data" layer in "RNA" assay yielded no results;
## utilizing "counts" layer instead.

# NB: 'nFeature_RNA' are the number of genes detected in each cell from the RNA expression,
# 'nCount_RNA' are the number of UMIs (unique molecular identifiers) detected in each cell, i.e. the number of transcripts.
# Use FeatureScatter to visualize relationships between these 3 features:
plot1 <- FeatureScatter(sc, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot2 <- FeatureScatter(sc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot1 + plot2

# E.g., we see some cells with high mitochondrial percentage also have low counts, and are likely low quality cells

# Check which genes contribute the most to such reads. We can for instance plot the percentage of counts per gene:
# This is an extra QC bit that Seurat doesn't provide but is still interesting to check.

C <- sc@assays$RNA$counts
C <- Matrix::t(Matrix::t(C)/Matrix::colSums(C)) * 100
most_expressed <- order(Matrix::rowSums(C), decreasing = T)[20:1]
boxplot(t(as.matrix(C[most_expressed, ])), cex = 0.1, las = 1, xlab = "% total count per cell",
        col = (scales::hue_pal())(20)[20:1], horizontal = TRUE)

Subset the dataset: keep cells with mitochondrial fraction ≤ 0.15 and number of genes observed ≥ 500

# To keep the initial Seurat object ('sc'), I save the subset to a new filtered object, 'scf':
scf <- subset(sc, subset = nFeature_RNA >= 500 & percent.mt < 15) # & nFeature_RNA < 5000 # down to 24175 cells, median of 1315 genes

Check again the features for the subsetted cells:

plot3 <- FeatureScatter(scf, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot4 <- FeatureScatter(scf, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot3 + plot4

#To compare the with the previous ones:
(plot1 + plot2) / (plot3 + plot4)

Proceed with the standard Seurat normalization and dimensional reduction:

# Apply a global-scaling normalization method that normalizes the feature expression measurements 
# for each cell by the total expression, multiplies this 
# by a scale factor (10,000 by default), and log-transforms the result
scf <- NormalizeData(scf, normalization.method = "LogNormalize", scale.factor = 10000)
## Normalizing layer: counts
# VST (variance stabilizing transformation) looks at the trend between variance and mean in the data, and then tries to find a strictly monotonous transformation of the data so that this trend is removed.
# In practice, the transformation will approach the logarithm function for high values and the square root function for small values (incl. 0), and smoothly interpolate in-between.
scf <- FindVariableFeatures(scf, selection.method = "vst", nfeatures = 2000)
## Finding variable features for layer counts
# Next, we apply a linear transformation (‘scaling’) that is 
# a standard pre-processing step prior to dimensional reduction techniques like PCA.
all.genes <- rownames(scf)
scf <- ScaleData(scf, features = all.genes)
## Centering and scaling data matrix
# Next we perform Principal Component Analysis on the scaled data.
# By default, only the previously determined variable features are used as input, 
# but can be defined using features argument if you wish to choose a different subset.
scf <- RunPCA(scf, features = VariableFeatures(scf))
## PC_ 1 
## Positive:  COL6A2, COL6A1, COL6A3, COL3A1, COL1A2, VIM, SPARC, FSTL1, DCN, VCAN 
##     LAMB1, C1S, FBN1, LUM, IGFBP7, POSTN, AEBP1, BGN, COL5A2, CALD1 
##     MMP2, COL5A1, C1R, FN1, COL1A1, CXCL12, SFRP2, CCDC80, PDGFRB, COL14A1 
## Negative:  ALDH3B2, ZWINT, SCD, NQO1, FADS2, CLDN4, AGR2, CLDN3, RRM2, ECM1 
##     WDR34, SDR16C5, HIST1H1B, TAT, PVALB, FIBCD1, NRAS, PIP, GPRC5A, FNDC11 
##     HSPA1B, SLC39A6, S100P, UHRF1, CDK1, TYMS, NUSAP1, KIFC1, MCM4, LMNB1 
## PC_ 2 
## Positive:  LAPTM5, SRGN, CIITA, LTB, RGS1, CD52, MS4A6A, CYBB, MS4A7, MPEG1 
##     SPI1, CSF1R, CD163, ITGAX, WDFY4, TRBC1, KLRK1, CCL5, CTSS, HSPA6 
##     SIGLEC1, CSF2RA, BIRC3, TYROBP, FGL2, CR1, C1QA, ADAP2, GPR34, C1QB 
## Negative:  TPM1, CALD1, COL18A1, IGFBP4, COL6A2, SPARC, COL6A1, TIMP1, MYL9, ACTN1 
##     FN1, MGP, COL1A1, THBS1, FSTL1, GSN, TAGLN, COL4A2, COL1A2, AEBP1 
##     PHLDB1, IGFBP7, TIMP3, LAMB1, C1R, LTBP2, C1S, COL3A1, BGN, TACSTD2 
## PC_ 3 
## Positive:  KRT5, KRT17, KRT14, TRIM29, KRT15, KRT6B, CDH3, ITGB8, KRT23, TACSTD2 
##     KLK5, COL17A1, GABRP, MYH14, IRX1, PROM1, TNS4, SFRP1, CEACAM6, DSC3 
##     ITGA3, SCNN1A, ANXA3, SERPINA3, SOX10, CALML3, CD24, CRYAB, VTCN1, MIA 
## Negative:  COL6A3, COL3A1, COL1A2, LUM, DCN, ANLN, HJURP, TOP2A, COL5A2, VCAN 
##     NUSAP1, POSTN, RRM2, CDK1, KIFC1, FBN1, THBS2, CIT, SFRP2, ZWINT 
##     MKI67, TROAP, AURKB, FOXM1, NCAPG, COL1A1, SPAG5, ASPM, TPX2, KIF23 
## PC_ 4 
## Positive:  LTB, IL32, TRBC1, IKZF3, KLRK1, NELL2, CCL5, PIM2, MZB1, COL16A1 
##     POU2AF1, FCRL5, TIGIT, RIC3, COL7A1, AQP3, ZNF683, CD79A, CCR7, DERL3 
##     CYP21A2, MXRA8, TNFRSF17, IGKC, TSHZ2, JCHAIN, FBXO32, ABCA10, KIAA1755, CRACR2A 
## Negative:  MS4A6A, MS4A7, CSF1R, C1QA, CD163, C1QC, STAB1, C1QB, CD68, TYROBP 
##     FCER1G, CD14, MSR1, SPI1, ADAP2, CTSB, SLCO2B1, CYBB, LYZ, SIGLEC1 
##     FGL2, PDK4, TGFBI, MS4A4A, GPR34, DAB2, FPR3, KCTD12, LGMN, IFI30 
## PC_ 5 
## Positive:  CDH5, VWF, ADGRL4, PLVAP, FLT1, EMCN, CLEC14A, PTPRB, RAMP3, KDR 
##     EGFL7, ECSCR, CALCRL, ARHGEF15, DIPK2B, RAMP2, MMRN2, ROBO4, CD34, AFAP1L1 
##     TIE1, SLCO2A1, TMEM255B, ADGRF5, ADCY4, ESAM, SHANK3, ID1, HECW2, PODXL 
## Negative:  LUM, DCN, COL6A3, THBS2, SFRP2, C3, POSTN, COL1A1, COL1A2, C1S 
##     COL3A1, FBLN1, AEBP1, COL14A1, CDH11, LAMA2, MXRA5, PDGFRA, CCDC80, CTSK 
##     COL5A1, VCAN, COL6A1, MXRA8, COL16A1, ISLR, COL5A2, C1R, ADAM12, PODN
# Get an idea of how much percentage of variance is explained by each 
# of the first 50 principal components by an elbow plot:
ElbowPlot(scf, ndims = 50)

# Let's use all 50 PCs for the downstream analyses:

# Compute the shared nearest neighbor (SNN) graph on the dimensionally reduced (PCA) data
# For more details, see: https://satijalab.org/seurat/articles/pbmc3k_tutorial.html#cluster-the-cells
scf <- FindNeighbors(scf, dims = 1:50)
## Computing nearest neighbor graph
## Computing SNN
# Group cells into clusters; try different resolutions to end up with 17 clusters
scf <- FindClusters(scf, resolution = 0.25)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 28479
## Number of edges: 1177641
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9522
## Number of communities: 17
## Elapsed time: 12 seconds
# check number of clusters and number of assigned cells to each of those with:
table(scf$seurat_clusters) # same as: table(scf$RRNA_snn_res.0.25)
## 
##    0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
## 4572 3551 3403 3270 3230 2600 1990 1207 1150 1084  954  652  271  195  193  117 
##   16 
##   40
# To subcluster only a specific cluster, you can subset its cells and reprocess them from scratch, 
# or you can try use:
# https://satijalab.org/seurat/reference/findsubcluster

#Run non-linear dimensional reduction (UMAP or tSNE)
#scf <- RunTSNE(scf, seed.use=42)
scf <- RunUMAP(scf, dims=1:50, verbose = T, seed.use = 42)
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
## 12:54:40 UMAP embedding parameters a = 0.9922 b = 1.112
## 12:54:40 Read 28479 rows and found 50 numeric columns
## 12:54:40 Using Annoy for neighbor search, n_neighbors = 30
## 12:54:40 Building Annoy index with metric = cosine, n_trees = 50
## 0%   10   20   30   40   50   60   70   80   90   100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 12:54:53 Writing NN index file to temp file /var/folders/gy/y0pz67j11gn0km9nykrrdljr0000gs/T//Rtmp9tH5uk/filecaf1413f2ae9
## 12:54:53 Searching Annoy index using 1 thread, search_k = 3000
## 12:55:19 Annoy recall = 100%
## 12:55:20 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 12:55:22 Initializing from normalized Laplacian + noise (using RSpectra)
## 12:55:24 Commencing optimization for 200 epochs, with 1355660 positive edges
## 12:56:03 Optimization finished
# Plot the identified clusters on top of the UMAP representation:
DimPlot(scf, reduction = "umap", group.by = "seurat_clusters", label=T)

#DimPlot(scf, reduction = "tsne")
# save this Seurat object:
save(scf, file = "exercise_results/1/seurat_scf.RData")
#or SeuratDisk::SaveH5Seurat(scf, "scf.h5Seurat")
# Be careful with h5 files, on a network location you might have locking issues

# to load it again:
# load("r_data/seurat_scf.RData")

Check the top N most highly variable genes:

# Identify the 5 most highly variable genes
topN <- head(VariableFeatures(scf), 5)

# plot the variable features with and without labels
plot1 <- VariableFeaturePlot(scf)
plot2 <- LabelPoints(plot = plot1, points = topN, repel = TRUE, xnudge = 0, ynudge =0)
plot1 / plot2

What type of cells can you already expect to find, based on these few genes?

Explore gene expression across tissues, eg at: https://cellxgene.cziscience.com/gene-expression

Three good atlases as reference are also:

Azimuth (https://azimuth.hubmapconsortium.org/) is also a handy web application that allows online annotation of datasets with < 100’000 cells (and lighter than 1GB).

Sadly though, it has no reference for breast tissue/cancer presently.

Explore the marker genes of each cluster:

# DEG (Differentially Expressed Genes)
# NB, 'limma' package will be automatically installed through 'BiocManager' if missing:
# install.packages('BiocManager')
# BiocManager::install('limma')

# [This will take a while] Find markers for every cluster compared to all remaining cells, report only the positive ones:

# You can also use it with option 'test.use = "MAST"', a faster and better approach to DEG in single cell, to do so install it from BiocManager. See: https://github.com/RGLab/MAST

all.markers <- FindAllMarkers(scf, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25, verbose=F)
# mast.all.markers <- FindAllMarkers(scf, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25, test.use = "MAST", verbose=F)

# optionally, save the markers
# save(all.markers, file="r_data/all.markers.sc.RData")
# save(mast.all.markers, file="r_data/mast.all.markers.sc.RData")

all.markers %>%
  dplyr::group_by(cluster) %>%
  dplyr::slice_max(n = 2, order_by = avg_log2FC)
# Plot gene expression across clusters
VlnPlot(scf, features = c("IL7R", "CD2"))

FeaturePlot(scf, features = c("IL7R", "CD2"), order=T)

Use a pre-defined panel of marker genes for breast cancer to help you define the cell types (very broad classification).
# Download 10x Genomics' gene panel for human breast cancer (280 genes) and save it with the other files (I save it in the 'exercise_material' folder I created before):

download.file(url="https://cf.10xgenomics.com/supp/xenium/panels/hBreast_v1.csv", destfile = "exercise_material/10x_human_breast_cancer_gene_panel.csv")

# read it in R (only columns 1 and 5 needed) and name the columns
panel <- read.csv("exercise_material/10x_human_breast_cancer_gene_panel.csv", header = F)[,c(1,5)] %>% setNames(., c("gene", "annotation"))

# Get a view of it
str(panel)
## 'data.frame':    280 obs. of  2 variables:
##  $ gene      : chr  "ABCC11" "ACTA2" "ACTG2" "ADAM9" ...
##  $ annotation: chr  "Breast cancer" "Smooth muscle cells" "Breast myoepithelial cells" "Breast glandular cells" ...
# or
# View(panel)

# Check if any gene in the panel is not present in the dataset:
panel$gene[!panel$gene %in% rownames(scf)]
## [1] "AKR1C1"   "ANGPT2"   "APOBEC3B" "TPSAB1"
# Do we still have other genes in the panel for the same cell types? How many?
panel %>% filter(annotation == panel$annotation[!panel$gene %in% rownames(scf)]) %>% select(annotation) %>% table
## annotation
##       Breast cancer   Endothelial cells Smooth muscle cells 
##                   7                   2                   4
# Then let's subset the panel for the genes in the dataset, and use that to define an approximate score for each cell type:
panel <- panel %>% filter(gene %in% rownames(scf))

# extract the list of cell types in the annotation column
panel_cell_types <- panel$annotation %>% unique

# Explore the 'AddModuleScore' function from Seurat, and try to apply it for the annotation of 'Adipocytes'
# Note, I return the annotated scf object to a 'tmp' object for the moment being.
tmp <- AddModuleScore(scf, features = (panel %>% filter(annotation == "Adipocytes") %>% select(gene) %>% unlist), name = "Adipocytes_score")

# Plot the score. Note: Seurat adds a number (here '1') at the end of the name we've chosen for the score, 
# to avoid ambiguous duplicates when running multiple modules with the same name
FeaturePlot(tmp, features = "Adipocytes_score1", order=T)

# Repeat the same for the all the annotated cell types in the panel, in an automated way.
# As AddModuleScore accepts a list of vectors to score multiple expression programs at once, where 'each entry should be a vector of feature names' (i.e. of gene names),
# reshape the panel dataframe as a named list of vectors, one vector for each cell type annotation:
annotation_list <- panel %>% select(gene, annotation) %>% group_by(annotation) %>% # this groups each gene to the cell type annotation
                            group_map(~.$gene) %>% # this creates the list of vectors, where each vector is the set of genes for a cell type
                            setNames(., panel %>% group_keys(annotation) %>% pull(1)) # this names each vector in the list with the matching cell type label
## Warning: The `...` argument of `group_keys()` is deprecated as of dplyr 1.0.0.
## ℹ Please `group_by()` first
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# The warning about group_keys() can be ignored, check the output:
str(annotation_list)
## List of 19
##  $ Adipocytes                : chr [1:21] "ADH1B" "ADIPOQ" "ANKRD29" "CAVIN2" ...
##  $ B cells                   : chr [1:26] "ANKRD28" "APOBEC3A" "BANK1" "C2orf42" ...
##  $ Breast cancer             : chr [1:32] "ABCC11" "AR" "CAV1" "CCND1" ...
##  $ Breast glandular cells    : chr [1:52] "ADAM9" "AGR3" "ALDH1A3" "AQP3" ...
##  $ Breast myoepithelial cells: chr [1:15] "ACTG2" "C6orf132" "DSP" "DST" ...
##  $ Dendritic cells           : chr [1:9] "ADGRE5" "CD83" "CLEC9A" "FCER1A" ...
##  $ Endothelial cells         : chr [1:10] "AKR1C3" "AQP1" "CD93" "ESM1" ...
##  $ Epithelial cells          : chr [1:6] "ANKRD30A" "EPCAM" "KRT14" "KRT8" ...
##  $ Fibroblasts               : chr [1:14] "CCDC80" "DPT" "FBLN1" "IGF1" ...
##  $ Immune cells              : chr "PTPRC"
##  $ Macrophages               : chr [1:26] "AIF1" "APOC1" "BASP1" "C1QA" ...
##  $ Mast cells                : chr [1:2] "CPA3" "CTSG"
##  $ Monocytes                 : chr [1:3] "CD14" "FCGR3A" "S100A8"
##  $ Myeloid cells             : chr "ITGAM"
##  $ NK cells                  : chr [1:8] "CD247" "FCER1G" "GNLY" "KLRD1" ...
##  $ Neutrophils               : chr "CEACAM8"
##  $ Plasma cells              : chr "TNFRSF17"
##  $ Smooth muscle cells       : chr [1:13] "ACTA2" "AVPR1A" "CCL8" "CRISPLD2" ...
##  $ T cells                   : chr [1:35] "CCL5" "CCR7" "CD274" "CD3E" ...
# substitute the spaces with underscores in the cell type labels
names(annotation_list) <- names(annotation_list) %>% stringr::str_replace_all(pattern = " ", replacement = "_")

# To be on the safe side, make a temporary copy of the Seurat object and operate on that:
tmp = scf

tmp <- AddModuleScore(tmp, 
                      features = annotation_list, 
                        name = paste0(names(annotation_list) , "_score")) # add 'score_' at the end of the name of the score


# Observe how the score for each cell has been added to the metadata of the Seurat object
str(tmp@meta.data)
## 'data.frame':    28479 obs. of  25 variables:
##  $ orig.ident                       : Factor w/ 1 level "scFFPE": 1 1 1 1 1 1 1 1 1 1 ...
##  $ nCount_RNA                       : num  1145 8116 1393 3811 15982 ...
##  $ nFeature_RNA                     : int  812 3975 1104 2557 5758 1802 4050 795 5865 1143 ...
##  $ percent.mt                       : num  3.58 2.4 1.51 2.28 2 ...
##  $ RNA_snn_res.0.25                 : Factor w/ 17 levels "0","1","2","3",..: 4 5 9 10 6 9 5 1 7 4 ...
##  $ seurat_clusters                  : Factor w/ 17 levels "0","1","2","3",..: 4 5 9 10 6 9 5 1 7 4 ...
##  $ Adipocytes_score1                : num  0.04712 -0.03449 0.2371 -0.00226 0.02595 ...
##  $ B_cells_score2                   : num  0.0725 0.0581 -0.0813 -0.202 -0.1221 ...
##  $ Breast_cancer_score3             : num  0.531 0.256 0.14 0.235 0.298 ...
##  $ Breast_glandular_cells_score4    : num  0.0364 0.1406 -0.0604 0.0887 0.2361 ...
##  $ Breast_myoepithelial_cells_score5: num  -0.0295 -0.0767 -0.0887 0.7036 0.2332 ...
##  $ Dendritic_cells_score6           : num  0.488 0.151 -0.121 -0.232 -0.236 ...
##  $ Endothelial_cells_score7         : num  -0.0758 0.0791 0.1691 -0.1762 -0.1417 ...
##  $ Epithelial_cells_score8          : num  0.459 0.288 -0.37 0.841 0.769 ...
##  $ Fibroblasts_score9               : num  0.3994 -0.0975 0.2097 -0.0559 -0.0655 ...
##  $ Immune_cells_score10             : num  -0.137 -0.492 -0.277 -0.399 -0.462 ...
##  $ Macrophages_score11              : num  0.5503 0.7303 0.0338 -0.1134 -0.2114 ...
##  $ Mast_cells_score12               : num  -0.026 -0.0241 -0.0347 -0.0193 -0.0194 ...
##  $ Monocytes_score13                : num  0.858 0.596 -0.116 -0.171 -0.226 ...
##  $ Myeloid_cells_score14            : num  -0.0228 -0.1453 -0.1387 -0.1159 -0.1682 ...
##  $ NK_cells_score15                 : num  0.211 0.151 -0.115 -0.142 -0.141 ...
##  $ Neutrophils_score16              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Plasma_cells_score17             : num  0 -0.0445 -0.042 -0.0386 -0.0664 ...
##  $ Smooth_muscle_cells_score18      : num  0.2169 0.1377 0.5217 0.0529 -0.0167 ...
##  $ T_cells_score19                  : num  -0.00813 -0.16715 -0.19679 -0.23761 -0.34476 ...
# for aesthetics, we can remove the trailing number Seurat adds to the score label:
names(tmp@meta.data) <- names(tmp@meta.data) %>% stringr::str_replace(pattern = "_score.*", "_score")

# Plot all scores obtained:
FeaturePlot(tmp, features = names(tmp@meta.data) %>% grep(., pattern= "_score", value=T), # select names of the scores from the metadata and pass them as features
                  order=T) # plot cells with highest values on top

# To make the plotting quicker at lower resolution, use the function 'raster=T'
FeaturePlot(tmp, features = names(tmp@meta.data) %>% grep(., pattern= "_score", value=T),
                 order=T, raster= T, ncol=4)

# We can also plot them along with the initial clustering:
# Note, make sure you have the 'patchwork' library loaded to use '+' to put plots together:
FeaturePlot(tmp, features = names(tmp@meta.data) %>% grep(., pattern= "_score", value=T), order=T, raster= T) +
  DimPlot(scf, group.by = "seurat_clusters", label=T)

You can find out more about how Seurat’s AddModuleScore function works in this nicely written post: https://www.waltermuskovic.com/2021/04/15/seurat-s-addmodulescore-function/. The scores for each cell depend on the composition of the dataset, which might not be intuitive at first, and it’s important to remember when you use it to make assumptions about your data.

Once you are satisfied with the characterization of the clusters found, you can annotate them as follows (change the letters in the quotes below, add clusters if you resolved to define more):

tmp <- scf@meta.data %>% select(orig.ident, RNA_snn_res.0.25) %>% 
  dplyr::mutate( celltype = case_match(RNA_snn_res.0.25,
                                              "0" ~ "a",
                                              "1" ~ "b",                                                                                       
                                              "2" ~ "c",                                                                                                
                                              "3" ~ "d",                                                                                
                                              "4" ~ "e",                                                                                   
                                              "5" ~ "f",                                                                                    
                                              "6" ~ "g",                                                                                                   
                                              "7" ~ "h",                                                                                         
                                              "8" ~ "i",                                                                                                  
                                              "9" ~ "j",                                                                               
                                              "10" ~ "k",                                                                                  
                                              "11" ~ "l",                                                                                         
                                              "12" ~ "m",                                                                                           
                                              "13" ~ "n",                                                                                            
                                              "14" ~ "o",                                                                                         
                                              "15" ~ "p",                                                                                                 
                                              "16" ~ "q"
                                              )
                 ) %>% select(celltype) %>% unlist

names(tmp) <- colnames(scf)

# to edit some cell annotation afterwards, rename the label of specific cells such as:
tmp[names(tmp) %in% colnames(scf)[scf$RNA_snn_res.0.25 == 0] ] <- "corrected_celltype_label"

scf <- AddMetaData(scf,
                  metadata = tmp,
                  col.name = 'celltype_annotation'
)

rm(tmp)
# Visualize the clusters with the imported labels:
DimPlot(scf, group.by = "celltype_annotation", label=T, repel=T)